Gradient compression for distributed training

ABSTRACT

Techniques for exchanging compressed gradient data within a distributed system are disclosed. A set of gradients are computed at a first worker node of the distributed system using a neural network model and a set of weights associated with the neural network model. Each of the set of gradients having a value less than a threshold is clipped, resulting in non-clipped data elements and clipped data elements. A mapping indicating which of the set of gradients correspond to non-clipped data elements and which of the set of gradients correspond to clipped data elements is generated. Compressed data is generated based on the non-clipped data elements. The mapping and the compressed data are transmitted from the first worker node to a second worker node of the distributed system

BACKGROUND

Artificial neural networks, which are often simply referred to as neuralnetworks, are computing systems with architectures based on biologicalneural networks. Neural networks can be trained using training data tolearn how to perform certain tasks, such as identifying or classifyingphysical objects, activities, characters, etc., from images or videos. Aneural network may include multiple layers of processing nodes. Eachprocessing node in a layer can perform computations on input datagenerated by processing nodes in the preceding layer to generate outputdata. For example, a processing node may perform a set of arithmeticoperations such as multiplications and additions to generate anintermediate output, or perform post-processing operations on theintermediate output to generate a final output. A neural network mayinclude thousands or more of processing nodes and millions or more ofparameters.

The architecture of a neural network may include an input layer, anoutput layer, and a number of intermediate layers, often referred to ashidden layers. Each layer executes a computation on the outputs of theprevious layer, with the last layer (the output layer) providing a finalresult. With more layers, a neural network can, theoretically, performmore complex tasks, such as language translations and identifying (orclassifying) the contents of an image. A neural network with more thanthree hidden layers is sometimes referred to as a deep neural network.Deep neural networks can have many hidden layers, such as, for example,between five and more than a thousand layers.

Neural networks can be implemented using a central processing unit (CPU)to perform the computations. CPUs, however, tend to be optimized forsequential rather than parallel computations, and thus can suffer frompoor response times. Graphics processing units (GPUs) are optimized forparallel computations, but not necessarily for the result from onecomputation unit to be provided directly to another computation unit.Often, the result must first be written to a memory and then read back.Although GPUs can have better response times than CPUs, it would stillbe desirable to improve the execution time of a neural network.Recently, special-purpose integrated circuit devices, such as neuralnetwork processors or accelerators, have been developed to executeneural networks more efficiently than either CPUs or GPUs. These devicesinclude spatial architectures in which arithmetic logic units (ALUs) canpass data from one to another directly, in contrast to the temporalarchitectures employed by CPUs and GPUs in which ALUs can only fetchdata from the memory hierarchy but cannot communicate directly with eachother.

When a neural network is trained to perform a particular function, theparameters of the neural network (e.g., its weights, which represent thestrength of connections between different processing nodes) are adjustedover multiple iterations. The training process involves supplying theneural network with training data, which can include training input dataand corresponding reference output data which can support a particulardecision (e.g., a detection or a non-detection of an object in animage). The neural network can perform computations to combine theweights with the training input data to generate training output data,and the training output data can be compared against the referenceoutput data to generate error data (representing the differences betweenthe two). During the training, different training input data can beprovided to the neural network to generate different training outputdata. The weights of the neural network can be adjusted based on anobjective such as, for example, minimizing the differences between thetraining output data and the reference output data. To improve thelikelihood of the neural network generating a correct decision,typically a large volume of training input data covering a large numberof operation scenarios is used to train a neural network. As a result, atraining operation typically requires significant time and computationresources.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments in accordance with the present disclosure will bedescribed with reference to the drawings, in which:

FIG. 1 illustrates an example of a computational flow model for a neuralnetwork;

FIG. 2 illustrates an example of a training process to train a neuralnetwork;

FIG. 3 illustrates an example distributed system that can perform atraining process for a neural network;

FIGS. 4A-4C illustrate various example training steps performed by adistributed system;

FIGS. 5A-5C illustrate various example training steps performed by adistributed system;

FIGS. 6A and 6B illustrate example timing diagrams corresponding toFIGS. 4A-4C and FIGS. 5A-5C, respectively;

FIG. 7 illustrates an example timing diagram for a neural network modeltraining;

FIG. 8 illustrates an example communication of a set of gradientsbetween a transmitting worker node and a receiving worker node;

FIG. 9 illustrates an example diagram showing steps for exchangingcompressed gradient data within a distributed system;

FIG. 10 illustrates an example timing diagram for transmittingcompressed gradient data within a distributed system;

FIGS. 11A and 11B illustrate a method of exchanging compressed gradientdata within a distributed system;

FIG. 12 illustrates an example of an accelerator;

FIG. 13 illustrates an example of an acceleration engine;

FIG. 14 illustrates an example of a host system; and

FIG. 15 illustrates an example network.

DETAILED DESCRIPTION

During training of a neural network, a first neural network layer canreceive training input data, combine the training input data with theweights (e.g., by multiplying the training input data with the weightsand then summing the products) to generate first output data for theneural network layer, and propagate the output data to a second neuralnetwork layer, in a forward propagation operation. The second neuralnetwork layer performs another forward propagation operation on thefirst output data from the first neural network layer to generate secondoutput data, and propagates the second output data to higher neuralnetwork layers. The forward propagation operations can start at thefirst neural network layer and end at the highest neural network layer.The forward propagation operation at each neural network layer canrepresent different stages of extraction and processing of informationfrom the training input data. A decision can then be made based on theoutput data of the highest neural network layer.

The set of weights of the neural network can be generated and/or updatedby the training process to improve the likelihood of the neural networkgenerating a correct decision. An example training process can use agradient descent scheme. Specifically, as part of the training process,forward propagation operations can be performed on training input data,using the set of weights at each neural network layer, to generatetraining output data at the highest level neural network layer. Thetraining output data can be compared with reference output data thatsupports a particular decision. A set of gradients can be generatedbased on, for example, differences between the training output data andthe reference output data.

As part of the training process, each neural network layer can thenperform a backward propagation process to adjust the set of weights ateach neural network layer. Specifically, the highest neural networklayer can receive the set of gradients and compute, in a backwardpropagation operation, a set of first data gradients and a set of firstweight gradients based on applying the set of weights to the input datagradients in similar mathematical operations as the forward propagationoperation. The highest neural network layer can adjust the set ofweights of the layer based on the set of first weight gradients, whereasthe set of first data gradients can be propagated to the second highestneural network layer to influence the adjustment of the set of weightsof the previous neural network layer. The backward propagationoperations can start from the highest neural network layer and end atthe first neural network layer. The set of weights at each neuralnetwork layer can be adjusted, to complete one iteration of the trainingprocess. The training process can be repeated for the same training datafor a number of iterations until a loss objective (e.g., a thresholdinput data gradient) is achieved.

A training process is typically very time-consuming due to thesequential nature and data dependency among the operations involved inthe training process. Specifically, in a training process, a forwardpropagation operation is first performed at each neural network layer tocompute output data, then input data gradients are computed based on theoutput data (and reference output data), then a backward propagationoperation is performed at each neural network layer to compute theweight gradients, which is then followed by the updating of the weightsat each neural network layer. As the backward propagation operationsdepend on the forward propagation operations, the two sets of operationsmay not be performed in parallel. Moreover, due to data dependency amongthe neural network layers, the forward propagation operations and thebackward propagation operations also need to be performed sequentiallyfor each neural network layer. The lack of parallelism can drasticallyincrease the training time, which is further increased when multipleiterations of the training process on the same training input data isperformed to achieve the loss objective. Moreover, the training processtypically involves supplying the neural network with multiple sets oftraining data to cover different operation conditions, such that theneural network can be trained to provide a correct decision under thosedifferent operation conditions. The computing system that implements theneural network may need to perform additional training processes toprocess the additional training input data, which will further increasethe training time. Coupled with the fact that the training processtypically requires a higher precision than the inference operation, aslow training process can put a lot of stress on the computationresources.

A distributed system can accelerate a training process by distributingthe training process across multiple computing systems, which can bereferred to as worker nodes. Training data can be split into multipleportions, with each portion to be processed by a worker node. Eachworker node can perform the forward and backward propagation operationsindependently, and in parallel with each other, based on a portion ofthe training input data, to generate a set of weight gradients for eachneural network layer. Each worker node can exchange its set of weightgradients with other worker nodes, and average its set of weightgradients and the sets of weight gradients received from other workernodes. Each computing node can have the same set of averaged weightgradients, and can then update a set of weights for each neural networklayer based on the averaged weight gradients.

Distributing the training process across multiple worker nodes canreduce the amount of training input data to be processed at each workernode, which can reduce the execution time of the forward and backwardpropagation operations at each neural network layer and accelerate thetraining process. However, as distributed learning is typicallyimplemented over a relatively low speed network, the exchange of weightgradients among the worker nodes can introduce a substantial bottleneck.For example, in a case where the distributed system is in a cloudinfrastructure and worker nodes exchange weight gradients with eachother by sending network packets, the network latency can be substantialrelative to the execution times of the forward/backward propagationoperations. The network latency can diminish the reduction in thetraining time brought about by the distributed system, or even increasethe training time. Accordingly, techniques for reducing network latencyin distributed training are needed.

Embodiments described herein relate to systems, methods, and othertechniques for exchanging compressed gradient data within a distributedsystem. The described embodiments reduce the gradient exchangethroughput requirement between nodes by clipping portions of thecomputed gradients and performing a compression technique on theremaining data prior to transmission to the remote node. While someembodiments entail a small increase in overhead by way of a compressionheader that accompanies the compressed data, this comes with asignificant reduction in the size of the compressed data.

Some embodiments provide for communication of weight gradients (orsimply “gradients”) between a transmitting worker node and a receivingworker node of a distributed system. The transmitting worker node maycompute a set of gradients using a neural network model that are to betransmitted for synchronization at the receiving worker node. Prior totransmission, a sparsity analysis may be performed on the computedgradients at the transmitting worker node. The sparsity analysis maydetermine various statistics regarding the gradients, such as a mean orstandard deviation. The sparsity analysis may identify differentgradient values at which different sparse percentages can be achieved byclipping the gradients at the values. For example, a clipping thresholdT may be set at a 50% sparse percentage so that 50% of the gradients liebelow the threshold. The sparse percentage may be based on the number ofgradients in each layer so that the compression times and/ortransmission times can be better aligned or reduced for all layers.

After the clipping threshold Tis determined, clipping may be performedon the gradients by clipping each gradient having a value less than theclipping threshold T (by, for example, setting its value to zero). Thiscan result in the set of gradients including non-clipped data elementsand clipped data elements. Next, compressed data representing thenon-clipped data elements is generated along with a mapping thatidentifies the locations of the non-clipped data elements within the setof gradients. The mapping allows the receiving worker node to decompressand reconstruct the set of gradients by combining the non-clipped dataelements from the compressed data with the clipped data elements (e.g.,zeros) using the mapping.

Various benefits are achieved by way of the described techniques. Forexample, the described techniques provide better compression ratescompared to conventional compression algorithms for almost all sparsepercentages. For example, as demonstrated by experimental data describedherein, the described techniques provide compressed data that is smallerthan or close to the original data size for the lowest sparse percentagecases and significant compression rates for the higher sparse percentagecases. The described techniques provide for less complex hardwareimplementations, as there is no need to track the row/column index andpointer. Furthermore, the compression header that may be calculated hasa predictable length, which leads to more straightforward sliding-windowbuffer management. In addition to inter-node communication, thedescribed compression techniques can be used when data is transferredwithin a worker node (e.g., from a local memory to a compute engine).

In the following description, various examples will be described. Forpurposes of explanation, specific configurations and details are setforth in order to provide a thorough understanding of the examples.However, it will also be apparent to one skilled in the art that theexample may be practiced without the specific details. Furthermore,well-known features may be omitted or simplified in order not to obscurethe embodiments being described.

FIG. 1 illustrates an example of a computational flow model for a neuralnetwork 100. Neural networks take inspiration from the mechanics of theoperation of the human brain. According to various models of the brain,the main computational element of the brain is the neuron. Neurons areconnected together with a number of elements, with elements entering aneuron being referred to as dendrites and an element leaving a neuronbeing referred to as an axon. A neuron accepts signals via dendrites,performs a computation on the signals, and outputs a signal on an axon.The input and output signals are referred to as activations. The axon ofone neuron can branch out and be connected to the dendrites of multipleneurons. The connection between a branch of an axon and a dendrite iscalled a synapse.

A synapse can scale the signal crossing the synapse. The scaling factoris referred to as a weight, and is thought of as the way a brain is ableto learn: different weights result from different responses to input.Learning can change the weights, but the organization of the neurons andsynapses need not change to obtain the learning. The static structure ofthe brain can thus be used as a model for a program, and the weights canreflect tasks that the program has learned to perform.

Neural networks operate on the notion that a neuron's computationinvolves a weighted sum of input values. These weighted sums correspondto the value scaling performed by the synapses and the combining ofthose values in the neuron. A functional operation is performed in theneuron on the combined inputs. In the brain model, the operation appearsto be a non-linear function that causes the neuron to generate an outputonly when the inputs cross some threshold. Thus, by analogy, the nodesof a neural network can apply a non-linear function to the weighted sumof the values input into the nodes.

In the illustrated example, the neural network 100 includes an inputlayer 104, one or more middle layers that are often referred to ashidden layers 106, and an output layer 108. Each layer includes somenumber of nodes 102. In this example, each node 102 of the input layer104 is connected to each node 102 of the hidden layer 106-1. Theconnections, which would be referred to as synapses in the brain model,are referred to as weights 110. Also in this example, each node 102 ofthe hidden layer 106-N has a connection or weight 110 with each node 102of the output layer 108. The input layer 104 can receive inputs and canpropagate the inputs to the hidden layer 106-1. Weighted sums computedby the hidden layer 106-1 are propagated to the remaining hidden layers106 and subsequently to the output layer 108, which can present finaloutputs to a user. The outputs of the nodes 102 can be referred to asactivations, in keeping with the brain model.

An example of a computation that can occur at each layer in the exampleneural network 100 is as follows:

$y_{j} = {f\left( {{\sum\limits_{i = 1}^{3}{W_{ij} \times x_{i}}} + b} \right)}$

In the above equation, W_(ij) is a weight, x_(i) is an input activation,y_(j) is an output activation, ƒ( ) is a non-linear function, and b is abias term. Various non-linear functions can be used to achieve differentpurposes.

The model of the neural network 100 can be referred to as a directed,weighted graph. In a directed graph, each connection to or from a nodeindicates a direction (e.g., into the node or away from the node). In aweighted graph, each connection can have a weight. Tools for developingneural networks can visualize the neural network as a directed, weightedgraph, for ease of understanding and debuggability. In some cases, thesetools can also be used to train the neural network and output trainedweight values. Executing the neural network is then a matter of usingthe weights to conduct computations on input data.

Neural networks with many layers can be capable of learning high-levelfeatures having more complexity and abstraction than shallower networks.As an example, a neural network can be taught to recognize images. Inthis example, pixels of an image can be fed into the input layer of theneural network, and the outputs of the first layer can indicate thepresence of low-level features in the image, such as lines and edges. Atsubsequent layers, these features can be combined to measure the likelypresence of higher level features: the lines can be combined intoshapes, which can be further combined into sets of shapes. Given allthis information, the neural network can output a probability that thehigh-level features represent a particular object or scene. For example,the neural network can output whether an image contains a cat or doesnot contain a cat.

The learning phase of a neural network is referred to as training theneural network. During training, the neural network 100 is taught toperform a task. In learning the task, values for the weights 110 (andpossibly also the biases) are determined. The underlying modelarchitecture 112 for the neural network (e.g., the organization of nodesinto layers, the connections between the nodes of each layer, and thecomputation executed by each node) does not change during training. Oncetrained, the neural network 100 can perform the task by computing aresult using the weights 110 values that were determined duringtraining. Running the program for the neural network is referred to asinference.

As mentioned above, the training process of the neural network 100 canoccur across multiple worker nodes of a distributed system, such as aworker node 120 illustrated in FIG. 1. In various implementations, theworker node 120 may be a neural network hardware accelerator, a generalpurpose hardware processor, or other suitable computing system thatsupports the arithmetic operations involved in neural network processingas described above. The worker node 120 may include a hardware interfaceto communicate with other worker nodes via a network. The worker node120 may include computing resources to perform the operations of atraining process, which can include forward propagation operations, lossgradient operations, and backward propagation operations. The workernode 120 may receive training data including training input data andreference output data. Based on the training input data, the worker node120 may compute output data using the model architecture 112 and theweights 110. The worker node 120 may then compute error data bycomparing the output data and the reference output data, which may beused to compute a set of gradients. The gradients may be distributed toother nodes in the distributed system for gradient synchronization. Theworker node 120 may then receive synchronized gradients and/or weightadjustments that may be applied to the weights 110.

Because the worker node 120 is operating on different training data fromother worker nodes in the distributed system (e.g., different portionsof a training data set), the amount of error through an iteration of thetraining process can vary among the different worker nodes. To improvethe accuracy of the neural network model across the different trainingdata, the local gradients calculated by each worker node can beaccumulated and then averaged to derive a set of averaged gradients. Forexample, if the neural network model utilizes twenty weight values, afirst iteration of the training process at each worker node will producetwenty local gradients. The first local gradient from each worker nodecan be added together and be divided by the number of worker nodes toderive an averaged gradient for the first value. The calculation can beperformed for each of the twenty gradients to derive a set of twentyaveraged gradients.

FIG. 2 illustrates an example of a training process 200 to train aneural network, such as the neural network 100. As shown in FIG. 2, aforward propagation operation can be performed for each neural networklayer, such as a forward propagation operation 202 a for the lowestlayer 1 (which can correspond to layer 104 of FIG. 1), a forwardpropagation operation 202 a for layer 2 (which can correspond to layer106-1 of FIG. 1), a forward propagation operation 202 n for the highestlayer n (which can correspond to layer 108 of FIG. 1), etc. A forwardpropagation operation at a neural network layer can include themultiplication and summation computations between input data and a setof weights for that layer, followed by activation function processing togenerate output data. The output data can then propagate to the nextneural network layer as input to the forward propagation operation atthat layer. For example, as shown in FIG. 2, forward propagationoperation 202 a can combine training input data with w1 weights of layer1 to generate output data out1, which propagate to layer 2 as input.Forward propagation operation 202 b can combine data out1 with w2weights of layer 2 to generate output data out2, which can thenpropagate to the next layer. At the highest layer n, forward propagationoperation 202 n receive data outn−1 from layer n−1 (not shown in FIG.2), combine with wn weights of layer n, and generate output data outn.

A loss gradient operation 204 can compare the output data outn of layern against reference output data refoutn to generate input data gradientsdin. The input data gradients din can measure a rate of differencebetween outn and refoutn with respect to each data element of outputdata outn. In some examples, an objective of the training is to minimizethe difference between outn and refoutn such that the input datagradients din become close to zero.

Following the generation of input data gradients din by loss gradientoperation 204, a backward propagation operation 206 can be performed foreach neural network layer. For example, a backward propagation operation206 n can be performed at highest layer n, a backward propagationoperation 206 b can be performed at layer 2, a backward propagationoperation 206 a can be performed at layer 1. A backward propagationoperation at a neural network layer can be based on the weights of thatneural network layer, the data gradient input to that neural networklayer, as well as the input to the forward propagation operation of thatlayer. For example, for layer n, backward propagation operation 206 ncan receive, as inputs, weights wn, input data outn−1 (from forwardpropagation operation at neural network layer n−1), and input datagradient din. The backward propagation operation can performmultiplication and summation computations on the input to generateoutput data gradients (dn−1, d2, d1, etc. in FIG. 2) and weightgradients wgrad (dwn, dw2, dw1, etc. in FIG. 2). The output datagradients can be forwarded to the next lower neural network layer asinputs to the backward propagation operation in that layer, whereas theweight gradients can represent changes to be applied to weights at aneural network layer.

The weights at layer n can be updated by an update operation 208 (e.g.,update operation 208 n for layer n) based on the weight gradients dwnbased on the following equation:

wn′=wn−α×dwn

In this equation, wn′ can refer to the updated weights wn, whereas α caninclude a set of predetermined constants.

The output data gradients dn−1 generated by layer n can then propagateto the next lower neural network layer n−1 as input to the backwardpropagation operation at that layer. Backward propagation operation 202b of layer 2 can operate on data gradients d2, weights w2, and inputdata out1 to generate output data gradients d1 as well as weightgradients dw2. Weight gradients dw2 can be used by update operation 208b to update w2 weights. Data gradients d1 can propagate to layer 1.Backward propagation operation 202 a of layer 1 can operate on datagradients d2, weights w1, and training input data to generate weightgradients dw1. Weight gradients dw1 can be used by update operation 208a to update w1 weights.

A training process performed on a single computing system can be verytime-consuming due to the sequential nature of the training process.Specifically, as described above, in a training process a forwardpropagations is first performed at each neural network layer to computetraining output data, and then a backward propagation is performed ateach neural network layer to compute the weight gradients, which is thenfollowed by the updating of the weights at each neural network layer. Asthe backward propagation operations depend on the forward propagationoperations, the two sets of operations may not be performed in parallel.Moreover, due to data dependency among the neural network layers, theforward propagation operations and the backward propagation operationsalso need to be performed sequentially for each neural network layer.The lack of parallelism can drastically increase the training time,which is further increased when multiple batches of the training processare performed for different portions of the training data, and thebatches are repeated in multiple iterations to converge towards minimumdata gradients.

As described above, one way to accelerate a training process is by usinga distributed system, to distribute the training process across multiplecomputing devices, each of which can be configured as a worker node.Distributing the training process across multiple worker nodes canreduce the amount of training data to be processed at each worker node,which can reduce the time of completion of the forward and backwardpropagation operations and accelerate the training process. For example,as the volume of training data processed by each worker nodes has beenreduced, the durations of the forward propagation operation and backwardpropagation operation can be shorter.

FIG. 3 illustrates an example distributed system 300 that can perform atraining process for a neural network, according to someimplementations. As shown in FIG. 3, the distributed system 300 mayinclude a number of worker nodes (e.g., computing devices) 120-1, 120-2,. . . to 120-n, etc. Each worker node 120 can include a communicationinterface to communicate with each other via a computer network 306.Each worker node 120 can include computing resources to perform theoperations of a training process including forward propagationoperations, backward propagation operations, update weights operations,etc. The computing resources may include, for example, a neural networkprocessor, neural network accelerator, a graphics processing unit (GPU),a field programmable gate array (FPGA), a processor or co-processor, anapplication specific integrated circuit (ASIC), and/or other suitablecomputing circuitry that support the arithmetic operations involved inthe training process. Each worker node 120 can communicate, via computernetwork 306, with other worker nodes 120 to exchange weight gradients toperform exchange operations, and perform update weights operations afterthe exchange operations are completed.

The distributed system 300 may be initialized by an orchestrating agent304. In one example, the orchestrating agent 304 may receive a list ofjobs 302 that are to be performed. The orchestrating agent 304 maydetermine which of the worker nodes 120 are available to work on thelist of jobs 302, select a number of the worker nodes 120 to work on thelist of jobs 302, and provide instructions to each of the selectedworker nodes 120 for the list of jobs 302 to be completed. Theinstructions provided to each of the selected worker nodes 120 mayinclude the identity of other selected worker nodes and/or the identityof the next or previous worker node in a logical ring (for a logicalring topology). Upon completion of the list of jobs 302, the selectedworker nodes 120 may alert the orchestrating agent 304 that they areavailable to work on any subsequent jobs.

FIGS. 4A-4C illustrate various example training steps performed by adistributed system for a first synchronization scheme in which gradientsare synchronized at a single worker node. As the gradients are computedby each of the worker nodes 120-1, 120-2, and 120-3, the gradients aredistributed to the worker node 120-4, as shown in FIG. 4A. Next,gradient synchronization is performed by the worker node 120-4, as shownin FIG. 4B. After the gradients are synchronized, they are distributedfrom the worker node 120-4 to each of the worker nodes 120-1, 120-2, and120-3, as shown in FIG. 4C.

FIGS. 5A-5C illustrate various example training steps performed by adistributed system for a second synchronization scheme in whichgradients are synchronized at each worker node. As the gradients arecomputed by each of the worker nodes 120-1, 120-2, 120-3, and 120-4, thegradients are exchanged throughout such that each worker node receivesgradients from each other worker node, as shown in FIG. 5A. Next,gradient synchronization is performed by each of the worker nodes 120-1,120-2, 120-3, and 120-4, as shown in FIG. 5B. Since each of the workernodes 120-1, 120-2, 120-3, and 120-4 computes its own synchronizedgradients, there is no distribution of synchronized gradients, as shownin FIG. 5C.

FIGS. 6A and 6B illustrate example timing diagrams corresponding to thefirst and second synchronization schemes illustrated in FIGS. 4A-4C andFIGS. 5A-5C, respectively. In the example shown in FIG. 6A, during afirst training epoch, training data is loaded onto each of the workernodes 120 (optionally worker node 120-4), gradients are computed by eachof the worker nodes 120 (optionally worker node 120-4), the gradientsare transmitted from the worker nodes 120-1, 120-2, and 120-3 to theworker node 120-4 (as indicated by the downward arrows) whichsynchronizes the received gradients (along with, optionally, thegradients computed by the worker node 120-4), and then the synchronizedgradients are transmitted from the worker node 120-4 to the worker nodes120-1, 120-2, and 120-3 (as indicated by the upward arrows). The weightsassociated with the neural network model are updated based on thesynchronized gradients prior to a second training epoch, in which thesame process is repeated.

In the example shown in FIG. 6B, during a first training epoch, trainingdata is loaded onto each of the worker nodes 120, gradients are computedby each of the worker nodes 120, the gradients are transmitted (e.g.,exchanged) between each of the worker nodes 120 (as indicated by thedownward and upward arrows), and each of the worker nodes 120synchronize the received gradients (along with the gradients computed atthe particular worker node). The weights associated with the neuralnetwork model are updated based on the synchronized gradients prior to asecond training epoch, in which the same process is repeated.

FIG. 7 illustrates an example timing diagram for a neural network modeltraining. In the example shown in FIG. 7, during a first training epoch,training data loaded onto the worker node 120-1 is used for training ofa first neural network model having layers 1, 2, and 3. Next, gradientsare computed by the worker node 120-1 by performing forward propagationoperations for layers 1, 2, and 3, followed by backward propagationoperations for layers 3, 2, and 1. As each of the gradients for eachlayer are computed and become available, they are transmitted from theworker node 120-1 to the worker node 120-4 (as indicated by the downwardarrows). For example, the gradients for layer 3 are computed first andare then transmitted to the worker node 120-4, the gradients for layer 2are computed next and are then transmitted to the worker node 120-4, andthe gradients for layer 1 are computed last and are then transmitted tothe worker node 120-4.

The worker node 120-4 synchronizes the gradients received from theworker node 120-1 with other received gradients as they are received.For example, the gradients for layer 3 are received first and are begunto be synchronized before layers 1 and 2, the gradients for layer 2 arereceived next and are begun to be synchronized after layer 3 but beforelayer 1, and the gradients for layer 1 are received last and are begunto be synchronized after layers 2 and 3. The synchronized gradients aretransmitted from the worker node 120-4 to the worker nodes 120-1 (asindicated by the upward arrow). The weights associated with the firstneural network model are updated based on the synchronized gradientsprior to a second training epoch, in which the same process is repeated.

FIG. 8 illustrates an example communication of a set of gradientsbetween a transmitting worker node 802 and a receiving worker node 804.After uncompressed gradients 806 are computed by the transmitting workernode 802, but prior to transmission, the uncompressed gradients 806 arecompressed by a compression module 812 at the transmitting worker node802 to generate compressed gradients 808. The compressed gradients 808are then transmitted from the transmitting worker node 802 to thereceiving worker node 804. Upon receiving the compressed gradients 808,the compressed gradients 808 are decompressed by a decompression module814 at the receiving worker node 804, resulting in decompressedgradients 810.

FIG. 9 illustrates an example diagram showing steps for exchangingcompressed gradient data within a distributed system. A set of gradients902 are computed at a transmitting worker node using a neural networkmodel, a set of weights, and training data. The set of gradients 902includes data elements N0, N1, . . . , N7. A sparsity analysis isperformed on the set of gradients 902 to determine a clipping thresholdT. Clipping is performed on the set of gradients 902 by clipping each ofthe set of gradients 902 having a value less than the clipping thresholdT.

In the illustrated example, data elements NO, N4, and N7 are eachdetermined to have a value greater than the clipping threshold T anddata elements N1, N2, N3, N5, and N6 are each determined to have a valueless than the clipping threshold T. Accordingly, data elements N1, N2,N3, N5, and N6 are clipped by performing gradient clipping 906,resulting in a set of gradients 908 including non-clipped data elementsN0, N4, and N7 and clipped data elements equal to zero at the locationsof data elements N1, N2, N3, N5, and N6. In some examples, clipped dataelements may have a value other than zero, such as some predeterminedvalue greater than zero.

A gradient compression 910 is performed on the set of gradients 908 togenerate a header 912 and compressed data 918. The header may include amapping 914 and an original length 916. The mapping 914 may indicatewhich of the set of gradients 908 correspond to non-clipped dataelements and/or which of the set of gradients 908 correspond to clippeddata elements. In the illustrated example, the mapping 914 includes an8-bit bitmap with 1's at the locations of non-clipped data elements and0's at the locations of clipped data elements. In other examples, the1's and 0's may be flipped. In some examples, the mapping 914 mayinclude a 2D bitmap or matrix containing binary values (e.g., 1's or0's) indicating the locations of non-clipped data elements. The mapping914 may include the original length 916 indicating the number ofnon-clipped data elements and clipped data elements in the set ofgradients 908. In the illustrated example, the original length 916 isequal to 8 since there are 3 non-clipped data elements and 5 clippeddata elements.

The compressed data 918 may include the non-clipped data elements(and/or a representation of the non-clipped data elements) from the setof gradients 908: N0, N4, and N7. The header 912 and the compressed data918 are transmitted by the transmitting worker node to the receivingworker node. When they are received, a gradient decompression 920 may beperformed on the header 912 and the compressed data 918 to generatedecompressed data comprising a set of gradients 922. During the gradientdecompression 920, the set of gradients 922 are formed by positioningthe non-clipped data elements contained in (or represented by) thecompressed data 918 at the locations of the non-clipped data elementsidentified by the mapping 914. Accordingly, the set of gradients 922 mayexactly match the set of gradients 908.

FIG. 10 illustrates an example timing diagram for transmittingcompressed gradient data within a distributed system. In the exampleshown in FIG. 10, training data is loaded onto the worker node 120-1 fortraining of a neural network model having layers 1, 2, and 3. Gradientsfor each of the layers are computed by the worker node 120-1 byperforming forward propagation operations for layers 1, 2, and 3,followed by backward propagation operations for layers 3, 2, and 1. Aseach of the gradients for each layer are computed and become available,they are compressed by performing sparsity analysis 904, gradientclipping 906, and gradient compression 910 as described in reference toFIG. 9. For example, the gradients for layer 3 are computed first andare then clipped, compressed, and transmitted to the worker node 120-4,the gradients for layer 2 are computed next and are then clipped,compressed, and transmitted to the worker node 120-4, and the gradientsfor layer 1 are computed last and are then clipped, compressed, andtransmitted to the worker node 120-4.

In some embodiments, a DMA controller 1002 at the worker node 120-1 mayperform one or more of the compression tasks. For example, the DMAcontroller 1002 may utilize a gradient compression engine (GCE) toperform sparsity analysis, gradient clipping, quantization, and/orcompression. The GCE may build the compression header to record theoriginal length (number of non-clipped and clipped data elements) andthe compression bitmap.

The worker node 120-4 decompresses the compressed gradients receivedfrom worker node 120-1 and then synchronizes the gradients withgradients received from other worker nodes (e.g., the worker nodes 120-2and 120-3). For example, after the gradients for layer 3 are received,they are decompressed and are begun to be synchronized before layers 1and 2, after the gradients for layer 2 are received, they aredecompressed and are begun to be synchronized after layer 3 but beforelayer 1, and after the gradients for layer 1 are received, they aredecompressed and are begun to be synchronized after layers 2 and 3. Thesynchronized gradients are transmitted from the worker node 120-4 to theworker node 120-1 (as indicated by the upward arrow). The weightsassociated with the neural network model are updated based on thesynchronized gradients. In some embodiments, the samecompression/decompression operations may be performed with thesynchronized gradients before and after transmission of the set ofsynchronized gradients from the worker node 120-4 to the first workernode 120-1, as shown in FIG. 10.

In some embodiments, a DMA controller 1004 at the worker node 120-4 mayperform one or more of the decompression (and/or compression) tasks. Forexample, the DMA controller 1004 may utilize a gradient decompressionengine to read the compressed data and the header, insert zero dataelements between non-clipped (e.g., non-zero) data elements, and thenwrite out the decompressed data elements.

FIGS. 11A and 11B illustrate a method 1100 of exchanging compressedgradient data within a distributed system. One or more steps of themethod 1100 may be omitted during performance of the method 1100, andsteps of the method 1100 need not be performed in the order shown. Oneor more steps of the method 1100 may be performed by one or moreprocessors, such as a neural network processor or a component therein(e.g., a compression or decompression accelerator). The method 1100 maybe implemented as a computer-readable medium or computer program productcomprising instructions which, when the program is executed by one ormore computers, cause the one or more computers to carry out the stepsof the method 1100. Such computer program products can be transmitted,over a wired or wireless network, in a data carrier signal carrying thecomputer program product.

At step 1102, a neural network model is received at a first worker nodeof a distributed system. Further at step 1102, a set of weightsassociated with the neural network model may be received at the firstworker node. The neural network model may have a particular architecturewith a particular number of layers.

At step 1104, training data is received at the first worker node. Thetraining data may include training input data and correspondingreference output data. The training data may be used to train the neuralnetwork model.

At step 1106, a set of gradients are computed at the first worker node.The set of gradients may be computed using the neural network model, theset of weights, and the training data. The set of gradients may includegradients for a first layer of the neural network model, gradients for asecond layer of the neural network model, and gradients for a thirdlayer of the neural network model. The gradients for the third layer maybe computed first, the gradients for the second layer may be computednext, and the gradients for the third layer may be computed last.

At step 1108, a sparsity analysis is performed on the set of gradients.Performing the sparsity analysis may include determining statisticsassociated with the gradients. For example, an average/mean (or median)and a standard deviation of the set of gradients may be calculated. Insome embodiments, different standard deviations from the mean may becalculated to determine different cumulative levels. In someembodiments, the sparsity analysis may determine different sparsepercentages. For example, a 50% sparse percentage may correspond to thevalue at which, if a clipping threshold were applied at the value, 50%of the gradients would be set to zero. Similarly, a 90% sparsepercentage may correspond to the value at which, if a clipping thresholdwere applied, 90% of the gradients would be set to zero. In variousimplementations, the sparsity analysis may determine 5%, 10%, 15%, 20%,25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, or95% sparse percentages, and/or any percentages there between. In someembodiments, the sparsity analysis may be performed at the first workernode (e.g., by a DMA controller at the first worker node).

At step 1110, a clipping threshold Tis determined based on the sparsityanalysis. The clipping threshold T may be set to the value associatedwith the sparse percentage found in step 1108. For example, the clippingthreshold T may be set to the 50% sparse percentage or any sparsepercentage listed in step 1108. In some embodiments, the clippingthreshold T may be determined based on the sparsity analysis for aprevious layer of the neural network model. For example, in someembodiments, the sparsity analysis may be performed for a third layer ofthe neural network and the clipping threshold T for each of the thirdlayer, second layer, and first layer may be determined based on the samesparsity analysis, thereby increasing compression speed.

In some embodiments, the clipping threshold T may be set to differentsparse percentages for different layers of the neural network. In someinstances, the sparse percentage may be greater for layers with moregradients and less for layers with fewer gradients to better align thecompression times and/or transmission times for all layers. In oneexample, a 40% sparse percentage may be used for the third layer of theneural network model, a 60% sparse percentage may be used for the secondlayer of the neural network model (the second layer having moregradients than the third layer), and an 80% sparse percentage may beused for the first layer of the neural network model (the first layerhaving more gradients than the second layer). In some embodiments, theclipping threshold T may be determined at the first worker node (e.g.,by a DMA controller at the first worker node).

At step 1112, gradient clipping is performed on the set of gradientsusing the clipping threshold T by clipping each of the set of gradientshaving a value less than the clipping threshold T. Clipping a gradientmay cause its value to be set to zero or to some predetermined value(such as the clipping threshold T). In some embodiments, performinggradient clipping on the set of gradients results in the set ofgradients including non-clipped data elements and clipped data elements.In some embodiments, gradient clipping may be performed at the firstworker node (e.g., by a DMA controller at the first worker node).

At step 1114, a mapping is generated that indicates which of the set ofgradients correspond to non-clipped data elements and which of the setof gradients correspond to clipped data elements. The mapping mayinclude a bitmap with binary values indicating the locations of thenon-clipped data elements and the clipped data elements. For example,the bitmap may include 1's corresponding to the locations of non-clippeddata elements and 0's corresponding to the locations of clipped dataelements. In some embodiments, the mapping may be generated at the firstworker node (e.g., by a DMA controller at the first worker node).

At step 1116, a header is formed that includes the mapping. In someembodiments, the method 1100 may further include generating ordetermining an original length of the set of gradients. In suchembodiments, the header may further include the original length. In someembodiments, the header may be formed at the first worker node (e.g., bya DMA controller at the first worker node).

At step 1118, compressed data that includes (or represents) thenon-clipped data elements may be generated. In some embodiments, thecompressed data may be generated at the first worker node (e.g., by aDMA controller at the first worker node).

At step 1120, the header and the compressed data are transmitted fromthe first worker node to a second worker node of the distributed system.In some embodiments, the header and the compressed data may betransmitted for each layer of the neural network model in the order thatthe gradients for each layer are computed. For example, the header andthe compressed data for third layer may be transmitted first, the headerand the compressed data for second layer may be transmitted next, andthe header and the compressed data for the first layer may betransmitted last.

At step 1122, decompressed data is generated by combining thenon-clipped data elements from the compressed data with the clipped dataelements using the mapping. Once generated, the decompressed data mayinclude the set of gradients comprising the non-clipped data elementsand the clipped data elements. In some embodiments, the decompresseddata may be generated at the second worker node (e.g., by a DMAcontroller at the second worker node).

At step 1124, a set of synchronized gradients are computed at the secondworker node based on the set of gradients from the decompressed data andother received gradients. The set of synchronized gradients may becomputed as each of the set of gradients is received. For example,synchronized gradients for the third layer of the first neural networkmodel may be computed first, synchronized gradients for the second layerof the first neural network model may be computed next, and synchronizedgradients for the first layer of the first neural network model may becomputed last.

In some embodiments, the method 1100 may further include transmittingthe set of synchronized gradients from the second worker node to thefirst worker node and adjusting the set of weights based on the set ofsynchronized gradients. In some embodiments, the samecompression/decompression operations may be performed with thesynchronized gradients before and after transmission of the set ofsynchronized gradients from the second worker node to the first workernode.

The performance of the method 1100 was compared to the compressed rowstorage (CRS) algorithm in an experimental setting to prove itsfeasibility. The compression rate CR, which is the ratio between thecompressed data and the original uncompressed data, was calculated forthe method 1100 as

CR=(NNZ×DS×8+m×n+log₂(m×n))/(m×n×DS×8),

where NNZ is the number of non-zero elements, m is the number of columnsin the original uncompressed data matrix, n is the number of rows in theoriginal uncompressed data matrix (m×n being the total number ofelements in the data), and DS is the element data size.

As shown in the tables below, compared to the CRS algorithm, the method1100 had a better CR in both high sparse percentage matrices and lowsparse percentage matrices. The CRS algorithm provided a negativecompression rate, i.e. compressed data is larger than original data, forsparse percentages 50% and below. In some low sparse percentage cases,such as 5%, the CRS algorithm can even double the original data size.The method 1100 provides a high CR of 87% at a sparse percentage of 90%(where 90% of the elements are zero elements) as well as a CR above 0%at a very low sparse percentage of 5% (where only 5% of the elements arezero elements).

For a 32×128 matrix with FP16 data type with m=32 and n=128 and anoriginal data size of 8192, the following experimental results wereobtained:

Sparse Percentage = Compression 90% Header NNZ CR Method 1100 256.451545819.2 0.86869488 CRS Algorithm 885.2 819.2 0.791943359 Sparse Percentage= Compression 75% Header NNZ CR Method 1100 256.451545 2048 0.71869488CRS Algorithm 2114 2048 0.491943359 Sparse Percentage = Compression 50%Header NNZ CR Method 1100 256.451545 4096 0.46869488 CRS Algorithm 41624096 −0.008056641 Sparse Percentage = Compression 5% Header NNZ CRMethod 1100 256.451545 7782.4 0.01869488 CRS Algorithm 7848.4 7782.4−0.908056641 Sparse Percentage = Compression 0% Header NNZ CR Method1100 256.451545 8192 −0.03130512 CRS Algorithm 8258 8192 −1.008056641It was determined that the CRS algorithm outperforms the method 1100 ifthe sparse percentage is greater than 96.875%. For all other sparsepercentages, the method 1100 outperforms the CRS algorithm.

FIG. 12 illustrates an example of an accelerator 1202 that may be anintegrated circuit component of a processor, such as a neural networkprocessor. The processor may have other integrated circuit components,including additional accelerator engines. In various examples, theaccelerator 1202, for a set of input data (e.g., input data 1250), canexecute computations using a processing engine array 1210, an activationengine 1216, and/or a pooling engine 1218.

In various implementations, the memory subsystem 1204 can includemultiple memory banks 1214. In these implementations, each memory bank1214 can be independently accessible, meaning that the read of onememory bank is not dependent on the read of another memory bank.Similarly, writing to one memory bank does not affect or limit writingto a different memory bank. In some cases, each memory bank can be readand written at the same time. Various techniques can be used to haveindependently accessible memory banks 1214. For example, each memorybank can be a physically separate memory component that has an addressspace that is separate and independent of the address spaces of eachother memory bank. In this example, each memory bank may have at leastone read channel and may have at least one separate write channel thatcan be used at the same time. In these examples, the memory subsystem1204 can permit simultaneous access to the read or write channels ofmultiple memory banks. As another example, the memory subsystem 1204 caninclude arbitration logic such that arbitration between, for example,the outputs of multiple memory banks 1214 can result in more than onememory bank's output being used. In these and other examples, thoughglobally managed by the memory subsystem 1204, each memory bank can beoperated independently of any other.

Having the memory banks 1214 be independently accessible can increasethe efficiency of the accelerator 1202. For example, values can besimultaneously read and provided to each row of the processing enginearray 1210, so that the entire processing engine array 1210 can be inuse in one clock cycle. As another example, the memory banks 1214 can beread at the same time that results computed by the processing enginearray 1210 are written to the memory subsystem 1204. In contrast, asingle memory may be able to service only one read or write at a time.With a single memory, multiple clock cycles can be required, forexample, to read input data for each row of the processing engine array1210 before the processing engine array 1210 can be started.

In various implementations, the memory subsystem 1204 can be configuredto simultaneously service multiple clients, including the processingengine array 1210, the activation engine 1216, the pooling engine 1218,and any external clients that access the memory subsystem 1204 over acommunication fabric 1220. In some implementations, being able toservice multiple clients can mean that the memory subsystem 1204 has atleast as many memory banks as there are clients. In some cases, each rowof the processing engine array 1210 can count as a separate client. Insome cases, each column of the processing engine array 1210 can output aresult, such that each column can count as a separate write client. Insome cases, output from the processing engine array 1210 can be writteninto the memory banks 1214 that can then subsequently provide input datafor the processing engine array 1210. As another example, the activationengine 1216 and the pooling engine 1218 can include multiple executionchannels, each of which can be separate memory clients. The memory banks1214 can be implemented, for example, using static random access memory(SRAM).

In various implementations, the memory subsystem 1204 can includecontrol logic. The control logic can, for example, keep track of theaddress spaces of each of the memory banks 1214, identify memory banks1214 to read from or write to, and/or move data between the memory banks1214. In some implementations, memory banks 1214 can be hardwired toparticular clients. For example, a set of memory banks 1214 can behardwired to provide values to the rows of the processing engine array1210, with one memory bank servicing each row. As another example, a setof memory banks can be hard wired to receive values from columns of theprocessing engine array 1210, with one memory bank receiving data foreach column.

The processing engine array 1210 is the computation matrix of theexample accelerator 1202. The processing engine array 1210 can, forexample, execute parallel integration, convolution, correlation, and/ormatrix multiplication, among other things. The processing engine array1210 includes multiple processing engines 1211, arranged in rows andcolumns, such that results output by one processing engine 1211 can beinput directly into another processing engine 1211. Processing engines1211 that are not on the outside edges of the processing engine array1210 thus can receive data to operate on from other processing engines1211, rather than from the memory subsystem 1204.

In various examples, the processing engine array 1210 uses systolicexecution, in which data arrives at each processing engine 1211 fromdifferent directions at regular intervals. In some examples, input datacan flow into the processing engine array 1210 from the left and weightvalues can be loaded at the top. In some examples weights and input datacan flow from the left and partial sums can flow from top to bottom. Inthese and other examples, a multiply-and-accumulate operation movesthrough the processing engine array 1210 as a diagonal wave front, withdata moving to the right and down across the array. Control signals canbe input at the left at the same time as weights, and can flow acrossand down along with the computation.

In various implementations, the number of columns in the processingengine array 1210 determines the computational capacity of theprocessing engine array 1210, and the number of rows determines therequired memory bandwidth for achieving maximum utilization of theprocessing engine array 1210. The processing engine array 1210 can have,for example, 64 columns and 428 rows, or some other number of columnsand rows.

An example of a processing engine 1211 is illustrated in FIG. 12 in aninset diagram. As illustrated by this example, a processing engine 1211can include a multiplier-accumulator circuit. Inputs from the left caninclude, for example, input data i and a weight value w, where the inputdata is a value taken from either a set of input data or a set ofintermediate results, and the weight value is from a set of weightvalues that connect one layer of the neural network to the next. A setof input data can be, for example, an image being submitted foridentification or object recognition, an audio clip being provided forspeech recognition, a string of text for natural language processing ormachine translation, or the current state of a game requiring analysisto determine a next move, among other things. In some examples, theinput data and the weight value are output to the right, for input tothe next processing engine 1211.

In the illustrated example, an input from above can include a partialsum, pin, provided either from another processing engine 1211 or from aprevious round of computation by the processing engine array 1210. Whenstarting a computation for a new set of input data, the top row of theprocessing engine array 1210 can receive a fixed value for p_in, such aszero. As illustrated by this example, i and w are multiplied togetherand the result is summed with p_in to produce a new partial sum, p_out,which can be input into another processing engine 1211. Various otherimplementations of the processing engine 1211 are possible.

Outputs from the last row in the processing engine array 1210 can betemporarily stored in the results buffer 1212. The results can beintermediate results, which can be written to the memory banks 1214 tobe provided to the processing engine array 1210 for additionalcomputation. Alternatively, the results can be final results, which,once written to the memory banks 1214 can be read from the memorysubsystem 1204 over the communication fabric 1220, to be output by thesystem.

In some implementations, the accelerator 1202 includes an activationengine 1216. In these implementations, the activation engine 1216 cancombine the results from the processing engine array 1210 into one ormore output activations. For example, for a convolutional neuralnetwork, convolutions from multiple channels can be summed to produce anoutput activation for a single channel. In other examples, accumulatingresults from one or more columns in the processing engine array 1210 maybe needed to produce an output activation for a single node in theneural network. In some examples, activation engine 1216 can bebypassed.

In various examples, the activation engine 1216 can include multipleseparate execution channels. In these examples, the execution channelscan correspond to the columns of the processing engine array 1210, andcan perform an operation on the outputs of a column, the result of whichcan be stored in the memory subsystem 1204. In these examples, theactivation engine 1216 may be able to perform between 1 and n parallelcomputations, where n is equal to the number of columns in theprocessing engine array 1210. In some cases, one or more of thecomputations can be performed simultaneously. Examples of computationsthat each execution channel can perform include exponentials, squares,square roots, identities, binary steps, bipolar steps, sigmoidals, andramps, among other examples.

In some implementations, the accelerator 1202 can include a poolingengine 1218. Pooling is the combining of outputs of the columns of theprocessing engine array 1210. Combining can include for example,computing a maximum value, a minimum value, an average value, a medianvalue, a summation, a multiplication, or another logical or mathematicalcombination. In various examples, the pooling engine 1218 can includemultiple execution channels that can operating on values fromcorresponding columns of the processing engine array 1210. In theseexamples, the pooling engine 1218 may be able to perform between 1 and nparallel computations, where n is equal to the number of columns in theprocessing engine array 1210. In various examples, execution channels ofthe pooling engine 1218 can operate in parallel and/or simultaneously.In some examples, the pooling engine 1218 can be bypassed.

Herein, the activation engine 1216 and the pooling engine 1218 may bereferred to collectively as execution engines. The processing enginearray 1210 is another example of an execution engine. Another example ofan execution engine is a Direct Memory Access (DMA) engine, which may belocated outside the accelerator 1202.

Input data 1250 can arrive over the communication fabric 1220. Thecommunication fabric 1220 can connect the accelerator 1202 to othercomponents of a processor, such as a DMA engine that can obtain inputdata 1250 from an Input/Output (I/O) device, a storage drive, or anetwork interface. The input data 1250 can be, for exampleone-dimensional data, such as a character string or numerical sequence,or two-dimensional data, such as an array of pixel values for an imageor frequency and amplitude values over time for an audio signal. In someexamples, the input data 1250 can be three-dimensional, as may be thecase with, for example, the situational information used by aself-driving car or virtual reality data. In some implementations, thememory subsystem 1204 can include a separate buffer for the input data1250. In some implementations, the input data 1250 can be stored in thememory banks 1214 when the accelerator 1202 receives the input data1250.

In some examples, the accelerator 1202 can implement a neural networkprocessing engine. In these examples, the accelerator 1202, for a set ofinput data 1250, can execute a neural network to perform a task forwhich the neural network was trained. Executing a neural network on aset of input data can be referred to as inference or performinginference.

The weights for the neural network can be stored in the memory subsystem1204, along with input data 1250 on which the neural network willoperate. The neural network can also include instructions, which canprogram the processing engine array 1210 to perform various computationson the weights and the input data. The instructions can also be storedin the memory subsystem 1204, in the memory banks 1214 or in a separateinstruction buffer. The processing engine array 1210 can outputintermediate results, which represent the outputs of individual layersof the neural network. In some cases, the activation engine 1216 and/orpooling engine 1218 may be enabled for computations called for bycertain layers of the neural network. The accelerator 1202 can store theintermediate results in the memory subsystem 1204 for inputting into theprocessing engine array 1210 to compute results for the next layer ofthe neural network. The processing engine array 1210 can further outputfinal results from a last layer of the neural network. The final resultscan be stored in the memory subsystem 1204 and then be copied out tohost processor memory or to another location.

FIG. 13 illustrates an example of an acceleration engine 1300. Theacceleration engine 1300 is an example of an integrated circuit that caninclude one or more accelerators 1302 a-1302 n that may be similar tothe accelerator illustrated in FIG. 12.

In the example of FIG. 13, the acceleration engine 1300 includesmultiple accelerators 1302 a-1302 n, each of which can perform a set ofoperations. In various examples, the accelerators 1302 a-1302 n are forparticular types of operations, so that the accelerators 1302 a-1302 ncan perform the operations much faster than when similar operations areperformed by a general purpose processor. In various examples, toperform a set of operations, input data on which the operations are tobe performed must first be moved into the accelerators 1302 a-1302 n.Additionally, in some cases, program code is also moved into theaccelerators 1302 a-1302 n, which programs the operations that theaccelerators 1302 a-1302 n will perform on the data. In the illustratedexample, the acceleration engine 1300 includes n accelerators 1302a-1302 n. Examples of accelerators that can be included in theacceleration engine 1300 include graphics accelerators, floating pointaccelerators, neural network accelerators, and others. In variousexamples, the accelerators 1302 a-1302 n can each be the same (e.g.,each of the is a graphics accelerator) or can be different (e.g., theaccelerators 1302 a-1302 n include a graphics accelerator, a floatingpoint accelerator, and neural network accelerator).

The example acceleration engine 1300 further includes DRAM controllers1342 a-1342 k for communicating with an external memory. The externalmemory is implemented, in this example, using DRAM 1330. In theillustrated example, the acceleration engine 1300 includes k DRAMcontrollers 1342 a-1342 k, each of which may be able to communicate withan independent set of banks of DRAM. In other examples, other types ofRAM technology can be used for the external memory. The DRAM controllers1342 a-1342 k can also be referred to as memory controllers.

In various examples, input data and/or program code for the accelerators1302 a-1302 n can be stored in the DRAM 1330. Different programs cancause the accelerators 1302 a-1302 n to perform different operations.For example, when one of the accelerators is a neural networkaccelerator, one program can configure the neural network accelerator toperform speech recognition while another program can configure theneural network accelerator to perform image recognition. In variousexamples, different accelerators 1302 a-1302 n can be programmed withdifferent programs, so that each performs a different set of operations.In various examples, the processors 1348 a-1348 s can manage moving ofprogram code from the DRAM 1330 to the accelerators 1302 a-1302 n.

The example acceleration engine 1300 further includes I/O controllers1344 a-1344 p for communicating with I/O devices 1332 in the system. Theacceleration engine 1300 can communicate with I/O devices over, forexample, a processor bus. In some examples, the processor bus can beimplemented using Peripheral Component Interconnect (PCI) and/or avariation of the PCI bus protocol. The processor bus can connect theacceleration engine 1300 to I/O devices such as, for example, input andoutput devices, memory controllers, storage devices, and/or networkinterface cards, among other things. In some examples, the I/Ocontrollers 1344-1344 p can enable the acceleration engine 1300 to actas an I/O device for a host processor. For example, the accelerationengine 1300 can be the recipient of input data from the host processor,and a command indicating an operation to be performed on the input data(e.g., a particular computation or analysis). In the illustratedexample, the acceleration engine 1300 includes p I/O controllers 1344a-1344 p, each of which may include a separate root complex and maycommunicate with a separate set of I/O devices 1332. In other examples,other standardized bus protocols, such as Ultra Path Interconnect (UPI)can be used for the host bus. In other examples, a proprietary busprotocol can be used.

Movement of data in the acceleration engine 1300 can be managed by oneor more processors 1348 a-1348 s, which can also be referred to as datamanagement processors. In the example of FIG. 13, the accelerationengine 1300 includes s processors 1348 a-1348 s incorporated into thedevice (e.g., on the same silicon die). In other examples, theprocessors 1348 a-1348 s can be external to the acceleration engine 1300(e.g., on a different die and/or in a different package). In someexamples, the processors 1348 a-1348 s can manage the movement of datafrom I/O devices 1332 to the accelerators 1302 a-1302 n or the DRAM1330. For example, input data may be located at an I/O device 1332 or inprocessor memory, and the processors 1348 a-1348 s can move the inputfrom the I/O device 1332 or processor memory into an accelerator or intoDRAM 1330. As another example, program code for the accelerators 1302a-1302 n may be located on an I/O device 1332 or in processor memory.

The example acceleration engine 1300 further includes DMA engines 1346a-1346 d that can move data between the accelerators 1302 a-1302 n, DRAMcontrollers 1342 a-1342 k, and I/O controllers 1344 a-1344 p. In theillustrated example, the acceleration engine 1300 includes d DMA engines1346 a-1346 d. In some implementations, the DMA engines 1346 a-1346 dcan be assigned to specific tasks, such as moving data from the DRAMcontrollers 1342 a-1342 d to the accelerators 1302 a-1302 n, or movingdata between the I/O controllers 1344 a-1344 p and the accelerators 1302a-1302 n. These tasks can be assigned, for example, by enqueueingdescriptors with the DMA engines 1346 a-1346 d, where a descriptoridentifies an address for a block of data and an operation (e.g., a reador a write) to perform. A descriptor, for example, can direct a DMAengine to instruct a DMA controller to read a block of data from DRAM1330. A descriptor can, as a further example, instruct the DMA engine towrite data, read by the DMA controller, to an accelerator. Furtherdescriptors can be used to move data from an accelerator to DRAM 1330.

In various examples, each of the processors 1348 a-1348 s can beresponsible for managing the data movement for a different accelerator.In some examples, a processor may manage the data movement for more thanone accelerator. Similarly, in various examples, each of the processors1348 a-1348 s can be assigned to one or more DMA engines 1346 a-1346 d.In these and other examples, associations between processors 1348 a-1348s, accelerators 1302 a-1302 n, and DMA engines 1346 a-1346 d aredetermined by program code being executed by each respective processor.

In the example acceleration engine 1300, the various components cancommunicate over a chip interconnect 1320. The chip interconnect 1320primarily includes wiring for routing data between the components of theacceleration engine 1300. In some cases, the chip interconnect 1320 caninclude a minimal amount of logic, such as multiplexors to control thedirection of data, flip-flops for handling clock domain crossings, andtiming logic.

FIG. 14 illustrates an example of a host system 1400 in which anacceleration engine 1460 can be used. The acceleration engine 1460 ofFIG. 14 is an example of a device that can include one or moreaccelerators such as is illustrated in FIG. 13. The example host system1400 of FIG. 14 includes the acceleration engine 1460, a host processor1472, DRAM 1430 or processor memory, I/O devices 1432, and supportsystems 1474. In various implementations, the host system 1400 caninclude other hardware that is not illustrated here.

The host processor 1472 is a general purpose integrated circuit that iscapable of executing program instructions. In some examples, the hostprocessor 1472 can include multiple processing cores. A multi-coreprocessor may include multiple processing units within the sameprocessor. In some examples, the host system 1400 can include more thanone host processor 1472. In some examples, the host processor 1472 andthe acceleration engine 1460 can be one chip, such as, one or moreintegrated circuits within the same package.

In various examples, the host processor 1472 can communicate with othercomponents in the host system 1400 over one or more communicationchannels. For example, the host system 1400 can include a host processorbus, which the host processor 1472 can use to communicate with the DRAM1430, for example. As another example, the host system 1400 can includean I/O bus, such as a PCI-based bus, over which the host processor 1472can communicate with the acceleration engine 1460 and/or the I/O devices1432, for example. In various examples, the host system 1400 can,alternatively or additionally, include other communication channels orbusses, such as serial busses, power management busses, storage devicebusses, and so on.

In some examples, software programs executing on the host processor 1472can receive or generate input for processing by the acceleration engine1460. In some examples, the programs can select an appropriate neuralnetwork to execute for a given input. For example, a program may be forlanguage translation, and can select one or more neural networks capableof speech recognition and/or machine translation. In these and otherexamples, the programs can configure the acceleration engine 1460 withthe neural network to execute, and/or can select a neural networkprocessing engine on the acceleration engine 1460 that has previouslybeen configured to execute the desired neural network. In some examples,once the acceleration engine 1460 has started an inference on inputdata, the host processor 1472 can manage the movement of data (such asweights, instructions, intermediate results, results of conditionallayers, and/or final results) into or out of the acceleration engine1460.

In some examples, a software program that is using the accelerationengine 1460 to conduct an inference can read the result from aconditional layer from the acceleration engine 1460 and/or from astorage location, such as in DRAM 1430. In these examples, the programcan determine what action the neural network should take next. Forexample, the program can determine to terminate the inference. Asanother example, the program can determine to change the direction ofthe inference, which can be translated by lower level code and/or theneural network processor to a next layer to execute. In these and otherexamples, the execution flow of the neural network can be coordinated bysoftware.

The DRAM 1430 is memory that is used by the host processor 1472 forstorage of program code that the host processor 1472 is in the processof executing, as well as values that are being operated on. In someexamples, the data for a neural network (e.g., weight values,instructions, and other data) can be all or partially stored in the DRAM1430. DRAM is a common term for processor memory, and though DRAM isvolatile memory, processor memory can be volatile and/or non-volatile.Though not illustrated here, the host system 1400 can include othervolatile and non-volatile memories for other purposes. For example, thehost system 1400 can include a Read-Only Memory (ROM) that stores bootcode for booting the host system 1400 at power on, and/or BasicInput/Output System (BIOS) code.

Though not illustrated here, the DRAM 1430 can store instructions forvarious programs, which can be loaded into and be executed by the hostprocessor 1472. For example, the DRAM 1430 can be storing instructionsfor an operating system, one or more data stores, one or moreapplication programs, one or more drivers, and/or services forimplementing the features disclosed herein.

The operating system can manage and orchestrate the overall operation ofthe host system 1400, such as scheduling tasks, executing applications,and/or controller peripheral devices, among other operations. In someexamples, a host system 1400 may host one or more virtual machines. Inthese examples, each virtual machine may be configured to execute itsown operating system. Examples of operating systems include Unix, Linux,Windows, Mac OS, iOS, Android, and the like. The operating system may,alternatively or additionally, be a proprietary operating system.

The data stores can include permanent or transitory data used and/oroperated on by the operating system, application programs, or drivers.Examples of such data include web pages, video data, audio data, images,user data, and so on. The information in the data stores may, in someexamples, be provided over the network(s) to user devices. In somecases, the data stores may additionally or alternatively include storedapplication programs and/or drivers. Alternatively or additionally, thedata stores may store standard and/or proprietary software libraries,and/or standard and/or proprietary application user interface (API)libraries. Information stored in the data stores may be machine-readableobject code, source code, interpreted code, or intermediate code.

The drivers can include programs that provide communication betweencomponents in the host system 1400. For example, some drivers canprovide communication between the operating system and peripheraldevices or I/O devices 1432. Alternatively or additionally, some driversmay provide communication between application programs and the operatingsystem, and/or application programs and peripheral devices accessible tothe host system 1400. In many cases, the drivers can include driversthat provide well-understood functionality (e.g., printer drivers,display drivers, hard disk drivers, Solid State Device drivers, etc.).In other cases, the drivers may provide proprietary or specializedfunctionality.

The I/O devices 1432 can include hardware for connecting to user inputand output devices, such as keyboards, mice, pens, tablets, voice inputdevices, touch input devices, displays or monitors, speakers, andprinters, among other devices. The I/O devices 1432 can also includestorage drives and/or network interfaces for connecting to a network1480. For example, the host system 1400 can use a network interface tocommunicate with storage devices, user terminals, other computingdevices or servers, and/or other networks, among various examples.

In various examples, one or more of the I/O devices 1432 can be storagedevices. In these examples, the storage devices include non-volatilememory and can store program instructions and/or data. Examples ofstorage devices include magnetic storage, optical disks, solid statedisks, flash memory, and/or tape storage, among others. The storagedevice can be housed in the same chassis as the host system 1400 or maybe in an external enclosure. A storage device can be fixed (e.g.,attached by screws) or removable (e.g., having a physical releasemechanism and possibly a hot-plug mechanism).

Storage devices, the DRAM 1430, and any other memory component in thehost system 1400 are examples of computer-readable storage media.Computer-readable storage media are physical mediums that are capable ofstoring data in a format that can be read by a device such as the hostprocessor 1472. Computer-readable storage media can be non-transitory.Non-transitory computer-readable media can retain the data storedthereon when no power is applied to the media. Examples ofnon-transitory computer-readable media include ROM devices, magneticdisks, magnetic tape, optical disks, flash devices, and solid statedrives, among others. As used herein, computer-readable storage mediadoes not include computer-readable communication media.

In various examples, the data stored on computer-readable storage mediacan include program instructions, data structures, program modules,libraries, other software program components, and/or other data that canbe transmitted within a data signal, such as a carrier wave or othertransmission. The computer-readable storage media can, additionally oralternatively, include documents, images, video, audio, and other datathat can be operated on or manipulated through the use of a softwareprogram.

In various examples, one or more of the I/O devices 1432 can bePCI-based devices. In these examples, a PCI-based I/O device includes aPCI interface for communicating with the host system 1400. The term“PCI” or “PCI-based” may be used to describe any protocol in the PCIfamily of bus protocols, including the original PCI standard, PCI-X,Accelerated Graphics Port (AGP), and PCI-Express(PCIe) or any otherimprovement or derived protocols that are based on the PCI protocolsdiscussed herein. The PCI-based protocols are standard bus protocols forconnecting devices, such as a local peripheral device, to a host device.A standard bus protocol is a data transfer protocol for which aspecification has been defined and adopted by various manufacturers.Manufacturers ensure that compliant devices are compatible withcomputing systems implementing the bus protocol, and vice versa. As usedherein, PCI-based devices also include devices that communicate usingNon-Volatile Memory Express (NVMe). NVMe is a device interfacespecification for accessing non-volatile storage media attached to acomputing system using PCIe.

A PCI-based device can include one or more functions. A “function”describes the hardware and/or software of an operation that may beprovided by the PCI-based device. Examples of functions include massstorage controllers, network controllers, display controllers, memorycontrollers, serial bus controllers, wireless controllers, andencryption and decryption controllers, among others. In some cases, aPCI-based device may include more than one function. For example, aPCI-based device may provide a mass storage controller and a networkadapter. As another example, a PCI-based device may provide two storagecontrollers, to control two different storage resources. In someimplementations, a PCI-based device may have up to eight functions.

In some examples, the PCI-based device can include single-root I/Ovirtualization (SR-IOV). SR-IOV is an extended capability that may beincluded in a PCI-based device. SR-IOV allows a physical resource (e.g.,a single network interface controller) to appear as multiple virtualresources (e.g., sixty-four network interface controllers). Thus, aPCI-based device providing a certain functionality (e.g., a networkinterface controller) may appear to a device making use of the PCI-baseddevice to be multiple devices providing the same functionality. Thefunctions of an SR-IOV-capable storage adapter device may be classifiedas physical functions (PFs) or virtual functions (VFs). Physicalfunctions are fully featured functions of the device that can bediscovered, managed, and manipulated. Physical functions haveconfiguration resources that can be used to configure or control thestorage adapter device. Physical functions include the sameconfiguration address space and memory address space that anon-virtualized device would have. A physical function may have a numberof virtual functions associated with it. Virtual functions are similarto physical functions, but are light-weight functions that may generallylack configuration resources, and are generally controlled by theconfiguration of their underlying physical functions. Each of thephysical functions and/or virtual functions may be assigned to arespective thread of execution (such as for example, a virtual machine)running on a host device.

In various implementations, the support systems 1474 can includehardware for coordinating the operations of the acceleration engine1460. For example, the support systems 1474 can include a microprocessorthat coordinates the activities of the acceleration engine 1460,including moving data around on the acceleration engine 1460. In thisexample, the microprocessor can be an integrated circuit that canexecute microcode. Microcode is program code that can enable anintegrated circuit to have some flexibility in the operations that theintegrated circuit can execute, but because the program code uses alimited instruction set, the microprocessor may have more limitedcapability than the host processor 1472. In some examples, the programexecuted by the microprocessor is stored on the hardware ofmicroprocessor, or on a non-volatile memory chip in the host system1400. In some examples, the microprocessor and the acceleration engine1460 can be on chip, such as one integrated circuit on the same die andin the same package.

In some examples, the support systems 1474 can be responsible for takinginstructions from the host processor 1472 when programs executing on thehost processor 1472 request the execution of a neural network. Forexample, the host processor 1472 can provide the support systems 1474with a set of input data and a task that is to be performed on the setof input data. In this example, the support systems 1474 can identify aneural network that can perform the task, and can program theacceleration engine 1460 to execute the neural network on the set ofinput data. In some examples, the support systems 1474 only needs toselect an appropriate neural network processing engine of the neuralnetwork processor. In some examples, the support systems 1474 may needto load the data for the neural network onto the acceleration engine1460 before the acceleration engine 1460 can start executing the neuralnetwork. In these and other examples, the support systems 1474 canfurther receive the output of executing the neural network, and providethe output back to the host processor 1472.

In some examples, the operations of the support systems 1474 can behandled by the host processor 1472. In these examples, the supportsystems 1474 may not be needed and can be omitted from the host system1400.

In various examples, the host system 1400 can include a combination ofhost systems, processor nodes, storage subsystems, and I/O chassis thatrepresent user devices, service provider computers or third partycomputers.

User devices can include computing devices to access an application(e.g., a web browser or mobile device application). In some examples,the application may be hosted, managed, and/or provided by a computingresources service or service provider. The application may enable a userto interact with the service provider computer to, for example, accessweb content (e.g., web pages, music, video, etc.). The user device maybe a computing device such as, for example, a mobile phone, a smartphone, a personal digital assistant (PDA), a laptop computer, a netbookcomputer, a desktop computer, a thin-client device, a tablet computer,an electronic book (e-book) reader, a gaming console, etc. In someexamples, the user device may be in communication with the serviceprovider computer over one or more networks. Additionally, the userdevice may be part of the distributed system managed by, controlled by,or otherwise part of the service provider computer (e.g., a consoledevice integrated with the service provider computers).

The host system 1400 can also represent one or more service providercomputers. A service provider computer may provide a native applicationthat is configured to run on user devices, which users may interactwith. The service provider computer may, in some examples, providecomputing resources such as, but not limited to, client entities, lowlatency data storage, durable data storage, data access, management,virtualization, cloud-based software solutions, electronic contentperformance management, and so on. The service provider computer mayalso be operable to provide web hosting, databasing, computerapplication development and/or implementation platforms, combinations ofthe foregoing or the like. In some examples, the service providercomputer may be provided as one or more virtual machines implemented ina hosted computing environment. The hosted computing environment caninclude one or more rapidly provisioned and released computingresources. These computing resources can include computing, networkingand/or storage devices. A hosted computing environment may also bereferred to as a cloud computing environment. The service providercomputer may include one or more servers, perhaps arranged in a cluster,as a server farm, or as individual servers not associated with oneanother, and may host application and/or cloud-based software services.These servers may be configured as part of an integrated, distributedcomputing environment. In some examples, the service provider computermay, additionally or alternatively, include computing devices such asfor example a mobile phone, a smart phone, a personal digital assistant(PDA), a laptop computer, a desktop computer, a netbook computer, aserver computer, a thin-client device, a tablet computer, a gamingconsole, etc. In some instances, the service provider computer maycommunicate with one or more third party computers.

FIG. 15 illustrates an example network 1500, which can include one ormore host systems, such as the host system illustrated in FIG. 14. Forexample, the example network 1500 of FIG. 15 includes multiple nodes1502 a-1502 h, one or more of which can be a host system such as isillustrated in FIG. 14. Others of the nodes 1502 a-1502 h can be othercomputing devices, each of which include at least a memory for storingprogram instructions, a processor for executing the instructions, and anetwork interface for connecting to the network 1500.

In various examples, the network 1500 can be used to process data. Forexample, input data can be received at one of the nodes 1502 a-1502 h orfrom other networks 1508 with which the network 1500 can communicate. Inthis example, the input data can be directed to a node in the network1500 that includes an acceleration engine, for the acceleration engineto operate on and produce a result. The result can then be transferredto the node or other network from which the input data was received. Invarious examples, input data can be accumulated from various sources,including one or more of the nodes 1502 a-1502 h and/or computingdevices located in the other networks 1508, and the accumulated inputdata can be directed to one or more host systems in the network 1500.Results from the host systems can then be distributed back to thesources from which the input data was gathered.

In various examples, one or more of the nodes 1502 a-1502 h can beresponsible for operations such as accumulating input data for hostsystems to operate on, keeping track of which host systems are busy andwhich can accept more work, determining whether the host systems areoperating correctly and/or most efficiently, monitoring networksecurity, and/or other management operations.

In the example of FIG. 15, the nodes 1502 a-1502 h are connected to oneanother using a switched architecture with point-to point links. Theswitched architecture includes multiple switches 1504 a-1504 d, whichcan be arranged in a multi-layered network such as a Clos network. Anetwork device that filters and forwards packets between local areanetwork (LAN) segments may be referred to as a switch. Switchesgenerally operate at the data link layer (layer 2) and sometimes thenetwork layer (layer 3) of the Open System Interconnect (OSI) ReferenceModel and may support several packet protocols. The switches 1504 a-1504d of FIG. 15 may be connected to the nodes 1502 a-1502 h and providemultiple paths between any two nodes.

The network 1500 may also include one or more network devices forconnection with other networks 1508, such as a router 1506. Routers useheaders and forwarding tables to determine the best path for forwardingthe packets, and use protocols such as internet control message protocol(ICMP) to communicate with each other and configure the best routebetween any two devices. The router 1506 of FIG. 15 can be used toconnect to other networks 1508 such as subnets, LANs, wide area networks(WANs), and/or the Internet.

In some examples, network 1500 may include any one or a combination ofmany different types of networks, such as cable networks, the Internet,wireless networks, cellular networks and other private and/or publicnetworks. The interconnected switches 1504 a-1504 d and the router 1506,if present, may be referred to as a switch fabric 1510, a fabric, anetwork fabric, or simply a network. In the context of a computernetwork, terms “fabric” and “network” may be used interchangeablyherein.

The nodes 1502 a-1502 h may be any combination of host systems,processor nodes, storage subsystems, and I/O chassis that represent userdevices, service provider computers or third party computers.

User devices may include computing devices to access an application 1532(e.g., a web browser or mobile device application). In some aspects, theapplication 1532 may be hosted, managed, and/or provided by a computingresources service or service provider. The application 1532 may allowthe user(s) to interact with the service provider computer(s) to, forexample, access web content (e.g., web pages, music, video, etc.). Theuser device(s) may be a computing device such as for example a mobilephone, a smart phone, a personal digital assistant (PDA), a laptopcomputer, a netbook computer, a desktop computer, a thin-client device,a tablet computer, an electronic book (e-book) reader, a gaming console,etc. In some examples, the user device(s) may be in communication withthe service provider computer(s) via the other network(s) 1508.Additionally, the user device(s) may be part of the distributed systemmanaged by, controlled by, or otherwise part of the service providercomputer(s) (e.g., a console device integrated with the service providercomputers).

The node(s) of FIG. 15 may also represent one or more service providercomputers. One or more service provider computers may provide a nativeapplication that is configured to run on the user devices, which user(s)may interact with. The service provider computer(s) may, in someexamples, provide computing resources such as, but not limited to,client entities, low latency data storage, durable data storage, dataaccess, management, virtualization, cloud-based software solutions,electronic content performance management, and so on. The serviceprovider computer(s) may also be operable to provide web hosting,databasing, computer application development and/or implementationplatforms, combinations of the foregoing or the like to the user(s). Insome examples, the service provider computer(s) may be provided as oneor more virtual machines implemented in a hosted computing environment.The hosted computing environment may include one or more rapidlyprovisioned and released computing resources. These computing resourcesmay include computing, networking and/or storage devices. A hostedcomputing environment may also be referred to as a cloud computingenvironment. The service provider computer(s) may include one or moreservers, perhaps arranged in a cluster, as a server farm, or asindividual servers not associated with one another and may host theapplication 1532 and/or cloud-based software services. These servers maybe configured as part of an integrated, distributed computingenvironment. In some aspects, the service provider computer(s) may,additionally or alternatively, include computing devices such as forexample a mobile phone, a smart phone, a personal digital assistant(PDA), a laptop computer, a desktop computer, a netbook computer, aserver computer, a thin-client device, a tablet computer, a gamingconsole, etc. In some instances, the service provider computer(s), maycommunicate with one or more third party computers.

In one example configuration, the node(s) 1502 a-1502 h may include atleast one memory 1518 and one or more processing units (or processor(s)1520). The processor(s) 1520 may be implemented in hardware,computer-executable instructions, firmware, or combinations thereof.Computer-executable instruction or firmware implementations of theprocessor(s) 1520 may include computer-executable or machine-executableinstructions written in any suitable programming language to perform thevarious functions described.

In some instances, the hardware processor(s) 1520 may be a single coreprocessor or a multi-core processor. A multi-core processor may includemultiple processing units within the same processor. In some examples,the multi-core processors may share certain resources, such as buses andsecond or third level caches. In some instances, each core in a singleor multi-core processor may also include multiple executing logicalprocessors (or executing threads). In such a core (e.g., those withmultiple logical processors), several stages of the execution pipelineand also lower level caches may also be shared.

The memory 1518 may store program instructions that are loadable andexecutable on the processor(s) 1520, as well as data generated duringthe execution of these programs. Depending on the configuration and typeof the node(s) 1502 a-1502 h, the memory 1518 may be volatile (such asRAM) and/or non-volatile (such as ROM, flash memory, etc.). The memory1518 may include an operating system 1528, one or more data stores 1530,one or more application programs 1532, one or more drivers 1534, and/orservices for implementing the features disclosed herein.

The operating system 1528 may support nodes 1502 a-1502 h basicfunctions, such as scheduling tasks, executing applications, and/orcontroller peripheral devices. In some implementations, a serviceprovider computer may host one or more virtual machines. In theseimplementations, each virtual machine may be configured to execute itsown operating system. Examples of operating systems include Unix, Linux,Windows, Mac OS, iOS, Android, and the like. The operating system 1528may also be a proprietary operating system.

The data stores 1530 may include permanent or transitory data usedand/or operated on by the operating system 1528, application programs1532, or drivers 1534. Examples of such data include web pages, videodata, audio data, images, user data, and so on. The information in thedata stores 1530 may, in some implementations, be provided over thenetwork(s) 1508 to user devices. In some cases, the data stores 1530 mayadditionally or alternatively include stored application programs and/ordrivers. Alternatively or additionally, the data stores 1530 may storestandard and/or proprietary software libraries, and/or standard and/orproprietary application user interface (API) libraries. Informationstored in the data stores 1530 may be machine-readable object code,source code, interpreted code, or intermediate code.

The drivers 1534 include programs that may provide communication betweencomponents in a node. For example, some drivers 1534 may providecommunication between the operating system 1528 and additional storage1522, network device 1524, and/or I/O device 1526. Alternatively oradditionally, some drivers 1534 may provide communication betweenapplication programs 1532 and the operating system 1528, and/orapplication programs 1532 and peripheral devices accessible to theservice provider computer. In many cases, the drivers 1534 may includedrivers that provide well-understood functionality (e.g., printerdrivers, display drivers, hard disk drivers, Solid State Devicedrivers). In other cases, the drivers 1534 may provide proprietary orspecialized functionality.

The service provider computer(s) or servers may also include additionalstorage 1522, which may include removable storage and/or non-removablestorage. The additional storage 1522 may include magnetic storage,optical disks, solid state disks, flash memory, and/or tape storage. Theadditional storage 1522 may be housed in the same chassis as the node(s)1502 a-1502 h or may be in an external enclosure. The memory 1518 and/oradditional storage 1522 and their associated computer-readable media mayprovide non-volatile storage of computer-readable instructions, datastructures, program modules, and other data for the computing devices.In some implementations, the memory 1518 may include multiple differenttypes of memory, such as SRAM, DRAM, or ROM.

The memory 1518 and the additional storage 1522, both removable andnon-removable, are examples of computer-readable storage media. Forexample, computer-readable storage media may include volatile ornon-volatile, removable or non-removable media implemented in a methodor technology for storage of information, the information including, forexample, computer-readable instructions, data structures, programmodules, or other data. The memory 1518 and the additional storage 1522are examples of computer storage media. Additional types of computerstorage media that may be present in the node(s) 1502 a-1502 h mayinclude, but are not limited to, PRAM, SRAM, DRAM, RAM, ROM, EEPROM,flash memory or other memory technology, CD-ROM, DVD or other opticalstorage, magnetic cassettes, magnetic tape, magnetic disk storage orother magnetic storage devices, solid state drives, or some other mediumwhich can be used to store the desired information and which can beaccessed by the node(s) 1502 a-1502 h. Computer-readable media alsoincludes combinations of any of the above media types, includingmultiple units of one media type.

Alternatively or additionally, computer-readable communication media mayinclude computer-readable instructions, program modules or other datatransmitted within a data signal, such as a carrier wave or othertransmission. However, as used herein, computer-readable storage mediadoes not include computer-readable communication media.

The node(s) 1502 a-1502 h may also include I/O device(s) 1526, such as akeyboard, a mouse, a pen, a voice input device, a touch input device, adisplay, speakers, a printer, and the like. The node(s) 1502 a-1502 hmay also include one or more communication channels 1536. Acommunication channel 1536 may provide a medium over which the variouscomponents of the node(s) 1502 a-1502 h can communicate. Thecommunication channel or channels 1536 may take the form of a bus, aring, a switching fabric, or a network.

The node(s) 1502 a-1502 h may also contain network device(s) 1524 thatallow the node(s) 1502 a-1502 h to communicate with a stored database,another computing device or server, user terminals and/or other deviceson the network(s) 1500.

In some implementations, the network device 1524 is a peripheral device,such as a PCI-based device. In these implementations, the network device1524 includes a PCI interface for communicating with a host device. Theterm “PCI” or “PCI-based” may be used to describe any protocol in thePCI family of bus protocols, including the original PCI standard, PCI-X,Accelerated Graphics Port (AGP), and PCI-Express(PCIe) or any otherimprovement or derived protocols that are based on the PCI protocolsdiscussed herein. The PCI-based protocols are standard bus protocols forconnecting devices, such as a local peripheral device to a host device.A standard bus protocol is a data transfer protocol for which aspecification has been defined and adopted by various manufacturers.Manufacturers ensure that compliant devices are compatible withcomputing systems implementing the bus protocol, and vice versa. As usedherein, PCI-based devices also include devices that communicate usingNon-Volatile Memory Express (NVMe). NVMe is a device interfacespecification for accessing non-volatile storage media attached to acomputing system using PCIe. For example, the bus interface module mayimplement NVMe, and the network device 1524 may be connected to acomputing system using a PCIe interface.

A PCI-based device may include one or more functions. A “function”describes operations that may be provided by the network device 1524.Examples of functions include mass storage controllers, networkcontrollers, display controllers, memory controllers, serial buscontrollers, wireless controllers, and encryption and decryptioncontrollers, among others. In some cases, a PCI-based device may includemore than one function. For example, a PCI-based device may provide amass storage controller and a network adapter. As another example, aPCI-based device may provide two storage controllers, to control twodifferent storage resources. In some implementations, a PCI-based devicemay have up to eight functions.

In some implementations, the network device 1524 may include single-rootI/O virtualization (SR-IOV). SR-IOV is an extended capability that maybe included in a PCI-based device. SR-IOV allows a physical resource(e.g., a single network interface controller) to appear as multipleresources (e.g., sixty-four network interface controllers). Thus, aPCI-based device providing a certain functionality (e.g., a networkinterface controller) may appear to a device making use of the PCI-baseddevice to be multiple devices providing the same functionality. Thefunctions of an SR-IOV-capable storage adapter device may be classifiedas physical functions (PFs) or virtual functions (VFs). Physicalfunctions are fully featured functions of the device that can bediscovered, managed, and manipulated. Physical functions haveconfiguration resources that can be used to configure or control thestorage adapter device. Physical functions include the sameconfiguration address space and memory address space that anon-virtualized device would have. A physical function may have a numberof virtual functions associated with it. Virtual functions are similarto physical functions, but are light-weight functions that may generallylack configuration resources, and are generally controlled by theconfiguration of their underlying physical functions. Each of thephysical functions and/or virtual functions may be assigned to arespective thread of execution (such as for example, a virtual machine)running on a host device.

The modules described herein may be software modules, hardware modulesor a suitable combination thereof. If the modules are software modules,the modules can be embodied on a non-transitory computer readable mediumand processed by a processor in any of the computer systems describedherein. It should be noted that the described processes andarchitectures can be performed either in real-time or in an asynchronousmode prior to any user interaction. The modules may be configured in themanner suggested in the preceding figures, and/or functions describedherein can be provided by one or more modules that exist as separatemodules and/or module functions described herein can be spread overmultiple modules.

The specification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense. It will, however, beevident that various modifications and changes may be made thereuntowithout departing from the broader spirit and scope of the disclosure asset forth in the claims.

Other variations are within the spirit of the present disclosure. Thus,while the disclosed techniques are susceptible to various modificationsand alternative constructions, certain illustrated examples thereof areshown in the drawings and have been described above in detail. It shouldbe understood, however, that there is no intention to limit thedisclosure to the specific form or forms disclosed, but on the contrary,the intention is to cover all modifications, alternative constructions,and equivalents falling within the spirit and scope of the disclosure,as defined in the appended claims.

The use of the terms “a” and “an” and “the” and similar referents in thecontext of describing the disclosed examples (especially in the contextof the following claims) are to be construed to cover both the singularand the plural, unless otherwise indicated herein or clearlycontradicted by context. The terms “comprising,” “having,” “including,”and “containing” are to be construed as open-ended terms (i.e., meaning“including, but not limited to,”) unless otherwise noted. The term“connected” is to be construed as partly or wholly contained within,attached to, or joined together, even if there is something intervening.Recitation of ranges of values herein are merely intended to serve as ashorthand method of referring individually to each separate valuefalling within the range, unless otherwise indicated herein and eachseparate value is incorporated into the specification as if it wereindividually recited herein. All methods described herein can beperformed in any suitable order unless otherwise indicated herein orotherwise clearly contradicted by context. The use of any and allexamples, or exemplary language (e.g., “such as”) provided herein, isintended merely to better illuminate examples of the disclosure and doesnot pose a limitation on the scope of the disclosure unless otherwiseclaimed. No language in the specification should be construed asindicating any non-claimed element as essential to the practice of thedisclosure.

Disjunctive language such as the phrase “at least one of X, Y, or Z,”unless specifically stated otherwise, is intended to be understoodwithin the context as used in general to present that an item, term,etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y,and/or Z). Thus, such disjunctive language is not generally intended to,and should not, imply that certain examples require at least one of X,at least one of Y, or at least one of Z to each be present.

Various examples of this disclosure are described herein, including thebest mode known to the inventors for carrying out the disclosure.Variations of those examples may become apparent to those of ordinaryskill in the art upon reading the foregoing description. The inventorsexpect skilled artisans to employ such variations as appropriate and theinventors intend for the disclosure to be practiced otherwise than asspecifically described herein. Accordingly, this disclosure includes allmodifications and equivalents of the subject matter recited in theclaims appended hereto as permitted by applicable law. Moreover, anycombination of the above-described elements in all possible variationsthereof is encompassed by the disclosure unless otherwise indicatedherein or otherwise clearly contradicted by context.

What is claimed is:
 1. A method of exchanging compressed gradient datawithin a distributed system for training a neural network model, themethod comprising: computing, at a transmitting worker node of thedistributed system, a set of gradients using the neural network modeland a set of weights associated with the neural network model;performing, at the transmitting worker node, a sparsity analysis on theset of gradients to determine a threshold; clipping, at the transmittingworker node, each of the set of gradients having a value less than thethreshold, resulting in the set of gradients comprising non-clipped dataelements and clipped data elements; generating, at the transmittingworker node, a mapping that indicates which of the set of gradientscorrespond to the non-clipped data elements and which of the set ofgradients correspond to the clipped data elements; generating, at thetransmitting worker node, compressed data comprising the non-clippeddata elements from the set of gradients; transmitting the mapping andthe compressed data from the transmitting worker node to a receivingworker node of the distributed system; generating, at the receivingworker node, decompressed data by combining the non-clipped dataelements from the compressed data with the clipped data elements usingthe mapping, such that the decompressed data includes the set ofgradients comprising the non-clipped data elements and the clipped dataelements; and computing, at the receiving worker node, a set ofsynchronized gradients based on the set of gradients and other gradientsreceived at the receiving worker node.
 2. The method of claim 1, furthercomprising: forming, at the transmitting worker node, a headercomprising the mapping and an original length of the set of gradients,the original length corresponding to a number of the non-clipped dataelements and the clipped data elements.
 3. The method of claim 1,wherein the mapping includes a bitmap with binary values indicatinglocations of the non-clipped data elements and the clipped dataelements.
 4. The method of claim 1, wherein clipping each of the set ofgradients includes setting the value equal to zero such that the clippeddata elements are zero data elements and the non-clipped data elementsare non-zero data elements.
 5. A method comprising: computing, at afirst worker node of a distributed system, a set of gradients using aneural network model and a set of weights associated with the neuralnetwork model; clipping each of the set of gradients having a value lessthan a threshold, resulting in the set of gradients comprisingnon-clipped data elements and clipped data elements; generating amapping that indicates which of the set of gradients correspond to thenon-clipped data elements and which of the set of gradients correspondto the clipped data elements; generating compressed data based on thenon-clipped data elements from the set of gradients; and transmittingthe mapping and the compressed data from the first worker node to asecond worker node of the distributed system.
 6. The method of claim 5,further comprising: generating, at the second worker node, decompresseddata by combining the non-clipped data elements from the compressed datawith the clipped data elements using the mapping to obtain the set ofgradients comprising the non-clipped data elements and the clipped dataelements.
 7. The method of claim 5, further comprising: forming a headercomprising the mapping, wherein the header and the compressed data aretransmitted from the first worker node to the second worker node.
 8. Themethod of claim 7, wherein the header further comprises an originallength of the set of gradients, the original length corresponding to anumber of the non-clipped data elements and the clipped data elements.9. The method of claim 5, wherein the mapping includes a bitmap withbinary values indicating locations of the non-clipped data elements andthe clipped data elements.
 10. The method of claim 5, wherein clippingeach of the set of gradients includes setting the value equal to zerosuch that the clipped data elements are zero data elements and thenon-clipped data elements are non-zero data elements.
 11. The method ofclaim 5, further comprising: performing a sparsity analysis on the setof gradients to determine the threshold.
 12. The method of claim 11,wherein performing the sparsity analysis includes: calculating anaverage for the set of gradients; calculating a standard deviation forthe set of gradients; and determining the threshold based on the averageand the standard deviation.
 13. A non-transitory computer-readablemedium having stored therein instructions that, when executed by one ormore processors, cause the one or more processors to perform operationsincluding: computing, at a first worker node of a distributed system, aset of gradients using a neural network model and a set of weightsassociated with the neural network model; clipping each of the set ofgradients having a value less than a threshold, resulting in the set ofgradients comprising non-clipped data elements and clipped dataelements; generating a mapping that indicates which of the set ofgradients correspond to the non-clipped data elements and which of theset of gradients correspond to the clipped data elements; generatingcompressed data based on the non-clipped data elements from the set ofgradients; and transmitting the mapping and the compressed data from thefirst worker node to a second worker node of the distributed system. 14.The non-transitory computer-readable medium of claim 13, wherein theoperations further comprise: generating, at the second worker node,decompressed data by combining the non-clipped data elements from thecompressed data with the clipped data elements using the mapping toobtain the set of gradients comprising the non-clipped data elements andthe clipped data elements.
 15. The non-transitory computer-readablemedium of claim 13, wherein the operations further comprise: forming aheader comprising the mapping, wherein the header and the compresseddata are transmitted from the first worker node to the second workernode.
 16. The non-transitory computer-readable medium of claim 15,wherein the header further comprises an original length of the set ofgradients, the original length corresponding to a number of thenon-clipped data elements and the clipped data elements.
 17. Thenon-transitory computer-readable medium of claim 13, wherein the mappingincludes a bitmap with binary values indicating locations of thenon-clipped data elements and the clipped data elements.
 18. Thenon-transitory computer-readable medium of claim 13, wherein clippingeach of the set of gradients includes setting the value equal to zerosuch that the clipped data elements are zero data elements and thenon-clipped data elements are non-zero data elements.
 19. Thenon-transitory computer-readable medium of claim 13, wherein theoperations further comprise: performing a sparsity analysis on the setof gradients to determine the threshold.
 20. The non-transitorycomputer-readable medium of claim 19, wherein performing the sparsityanalysis includes: calculating an average for the set of gradients;calculating a standard deviation for the set of gradients; anddetermining the threshold based on the average and the standarddeviation.